**Assignment 4** Student name: Kelvin Shen (#) Overview In this assignment, we implement neural style transfer which resembles specific content in a certain artistic style. The algorithm takes in a content image, a style image, and another input image. The input image is optimized to match the previous two target images in content and style distance space. We break this down into two steps: content reconstruction and texture synthesis. __Note: Bells and Whistles are completed and are demonstrated at the bottom.__ (#) Part 1: Content Reconstruction
Target Content Image
In this part, we implement content-space loss and optimize a random noise with respect to the content loss only. Content loss is simply the L2-distance between one layer of the input image feature and that of the target content image feature. To obtain image features, we use a pretrained VGG-19 net, which consists of 5 convolutional blocks (2 conv layers each in 1st and 2nd blocks, 4 conv layers each in the remaining 3 blocks). In the following table, I ablate the specific layers to insert content loss. I choose to incrementally append content loss after every block, which corresponds to conv_2, conv_4, conv_8, conv_12 and conv_16. | Layers to
Insert
Content Loss | Noise 1
| Noise 2
| | :---: | :---------------: | :---------------: | | __conv_2__ | | | | conv_2, conv_4 | | | | conv_2, conv_4, conv_8 | | | | conv_2, conv_4, conv_8, conv_12 | | | | conv_2, conv_4, conv_8, conv_12, conv_16 | | | By observing the ablation study, the best place to insert content loss is after two convolutional layers. Inserting content loss after every convolutional block does not improve the reconstruction quality. This is because different blocks store different levels of abstraction, e.g. earlier blocks learn edges or simple features whereas later blocks learn more general and complex patterns. That said, when we insert content loss after different blocks and update the network through backpropagation, we should update blocks proportionally instead of treating all blocks equally with the same magnitude of content loss. (#) Part 2: Texture Synthesis
Target Style Image
In this part, we implement style-space loss, which uses the Gram matrix to measure the distance of the styles of two images. The Gram matrix is a mathematical representation of the correlations between the feature maps of a CNN. Similar to the previous part, I ablate the specific layers to insert style loss. I choose to insert style loss layers after each of the first 5 conv layers, each of the middle 5 conv layers, each of the last 5 conv layers, as well as inserting them after each conv block. | Layers to
Insert
Style Loss | Noise 1
| Noise 2
| | :---: | :---------------: | :---------------: | | __conv_1, conv_2, conv_3, conv_4, conv_5__ | | | | conv_6, conv_7, conv_8, conv_9, conv_10 | | | | conv_11, conv_12, conv_13, conv_14, conv_15 | | | | conv_2, conv_4, conv_8, conv_12, conv_16 | | | By observing the ablation study, the best place to insert style loss is after each of the first 5 conv layers. Inserting style loss after the 11th, 12th, 13th, 14th, and 15th layer does not synthesize the target texture at all because later convolutional blocks in a CNN are used to learn more abstract and general patterns but the target texture is very detailed and concrete (full of edges, corners, etc.) and must be learned through earlier convolutional blocks. Intuitively, the last row of the ablation study can be seen as an average of how well early blocks and later blocks learn the target texture. Therefore, we should choose the first configuration. (#) Part 3: Style Transfer In this part, we combine content reconstruction and texture synthesis to finally achieve neural style transfer. We will provide both target content image and target style image to the network. The network will either optimize a random noise (just as before) or optimize the content image directly. (##) Hyper-parameters Tuning
Content Image Style Image
In this part, we will first ablate the hyper-parameters, specifically $\lambda_{style}$ and $\lambda_{content}$. They are the weights to be multiplied before the total content loss and the total style loss. Intuitively, this pair of parameters represents whether the network should focus more on reconstructing the image itself or on emphasizing style transfer. | Weights | Input: Noise
| Input: Content Image
| | :--- | :---------------: | :---------------: | | $\lambda_{style}=10^3$ $\lambda_{content}=1$ | | | | $\lambda_{style}=10^5$ $\lambda_{content}=1$ | | | | $\lambda_{style}=10^7$ $\lambda_{content}=1$ | | | We can find out that $\lambda_{style}=10^5, \lambda_{content}=1$ is the best trade-off between content reconstruction and texture synthesis. (##) Results Here I optimize the same random noise or the content image directly and show the results as below. I record the time of the optimization given a fixed number of iterations (i.e. $300$) across all experiments. (###) Optimize random noise | |
Style 1 |
Style 2 | | :-----------: | :-----------: | :-----------: | |
Content 1 |
time: 13.22 sec |
time: 13.24 sec | |
Content 2 |
time: 10.69 sec |
time: 10.74 sec | (###) Optimize content image directly | |
Style 1 |
Style 2 | | :-----------: | :-----------: | :-----------: | |
Content 1 |
time: 12.09 sec |
time: 12.02 sec | |
Content 2 |
time: 9.52 sec |
time: 9.46 sec | We can see that optimizing the content image directly is strictly faster than optimizing the random noise. Quantitatively, the former is 9.23%, 11.93%, 8.59%, and 10.95% faster than the latter in the four "style+content" permutations above. In addition, optimizing content images directly yields much better quality of style transfer. In particular, it will take many more iterations for optimizing random noise to generate equally good result as optimizing content image, because content image can serve as a good initialization. Finally I show two results on some of my favorite images, by using the best configuration we have observed in previous parts: inserting content loss after `conv_2`, inserting style loss after `conv_1, conv_2, conv_3, conv_4, conv_5`, setting $\lambda_{style}=10^5, \lambda_{content}=1$, and optimizing content image directly. | Content Image | Style Image | Output | | :-----------: | :-----------: | :-----------: | | | | | | | | | The first row takes 12.78 sec, which is 44.88% faster than optimizing the random noise. The second row takes 9.52 sec, which is 11.17% fasrer than optimizing the random noise. (#) Bells & Whistles (##) Stylize Poisson blended images from the previous homework I use my Poisson blended images generated in HW2 here as content images.
Content Image Style Image Output
(##) Apply style transfer to a video I naively apply neural style transfer to a video frame by frame. We can see that since we do not enforce any temporal constraint, the output video has a lot of temporal flicering artifacts. One can solve this by enforcing temporal smoothness during optimization.
Content Vidoe Style Image Output
(##) Use a feedforward network to output style transfer results directly I reference the open-source implementation of [Perceptual Losses for Real-Time Style Transfer](https://arxiv.org/abs/1603.08155) which can be found [here](https://github.com/tyui592/Perceptual_loss_for_real_time_style_transfer). This work uses a feedforward network to transfer an input image according to a particular style. Different from what we implemented above, where we keep the network fixed and optimize the input directly, this work optimizes the network weights such that the content image matches the target style. Therefore, during inference, one only needs to provide a content image because the style has been already encoded inside the weights of the network. I use the pretrained models provided by the author and generate the results on our content images. | |
Style 1 |
Style 2 | | :-----------: | :-----------: | :-----------: | |
Content 1 | | | |
Content 2 | | | (##) Automatically match the style image dimension with the content image dimension I implement `resize_and_crop_style` in `utils.py`, which automatically resizes and crops the style image to match its dimension with the content image, while preserving the aspect ratio. I calculates the aspect ratios of both style images and content images, and applies `transforms.functional.resize` and `transforms.functional.center_crop` to the style image.